Skip to content

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090

Open
jmsperu wants to merge 2 commits intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence
Open

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
jmsperu wants to merge 2 commits intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence

Conversation

@jmsperu
Copy link
Copy Markdown
Contributor

@jmsperu jmsperu commented May 1, 2026

Description

Fixes #13089

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value Behavior
reboot (default) Original behavior: echo b > /proc/sysrq-trigger
graceful-reboot systemctl reboot — allows running VMs to stop cleanly
restart-agent Restart cloudstack-agent only; running VMs preserved
log-only Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

  • scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
  • scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
  • agent/conf/agent.properties — document the new property
  • agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

  • Default action is reboot, identical to current behavior
  • Property is read with tail -n 1 so duplicate entries take the last value
  • If the property file is unreadable or the value is unrecognized, falls back to reboot
  • No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

  • reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
  • log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
  • restart-agent — verified systemctl restart cloudstack-agent is invoked
  • graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Related

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.

For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.

Adds a new agent.properties option:

    kvm.heartbeat.fence.action = reboot | graceful-reboot
                               | restart-agent | log-only

Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:

  - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
    a chance to shut down cleanly
  - restart-agent: restart cloudstack-agent only, preserving running VMs
  - log-only: log + alert, no automatic action

The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.

Refs: apache#13089
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.51%. Comparing base (30e6c22) to head (f5f3db5).
⚠️ Report is 206 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (30e6c22) and HEAD (f5f3db5). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (30e6c22) HEAD (f5f3db5)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               main   #13090       +/-   ##
=============================================
- Coverage     17.92%    3.51%   -14.41%     
=============================================
  Files          5939      464     -5475     
  Lines        533181    40154   -493027     
  Branches      65237     7559    -57678     
=============================================
- Hits          95585     1413    -94172     
+ Misses       426856    38551   -388305     
+ Partials      10740      190    -10550     
Flag Coverage Δ
uitests 3.51% <ø> (-0.16%) ⬇️
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

@DaanHoogland
Copy link
Copy Markdown
Contributor

@jmsperu , would you consider this for older versions as well?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

  • Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
  • Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
  • Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation.
agent/conf/agent.properties Documents kvm.heartbeat.fence.action for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +316
# write fails persistently. Supersedes the legacy binary
# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
#
Comment thread agent/conf/agent.properties Outdated
# reboot - immediate sysrq-trigger reboot (default; original behavior)
# graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly
# restart-agent - restart cloudstack-agent only; running VMs are preserved
# log-only - log + alert; take no automatic action (admin must investigate)
* <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li>
* <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li>
* <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li>
* <li>{@code log-only} — log + alert, no automatic action</li>
@NuxRo
Copy link
Copy Markdown
Contributor

NuxRo commented May 7, 2026

@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully.
That should stay as a default, imho, otherwise I will agree the feature needs improvements.
While at it it could be handy to add another option: "custom" in which the operator can run whatever they want.

sleep 5
echo b > /proc/sysrq-trigger
exit $?
;;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this part is generic for both, can keep it in a common script and call that with the fence action?

…me to hard-reboot

Per review on PR apache#13090:
- Refactor common fence-action case into kvmha-fence.sh sourced by both
  kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti).
- Add 'custom' fence action that invokes an operator-supplied script
  (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/
  heartbeat-fence-custom.sh) with the heartbeat script name as arg;
  falls back to hard-reboot if the script is missing/non-executable
  (per @NuxRo).
- Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep
  'reboot' accepted as alias so existing deployments don't break (per
  @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot,
  required where a stale NFSv3 mount blocks systemctl reboot.
@jmsperu
Copy link
Copy Markdown
Contributor Author

jmsperu commented May 9, 2026

Pushed f5f3db5ea3 addressing all review comments:

@sureshanaparti — extracted the shared fence-action case into a new scripts/vm/hypervisor/kvm/kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh. Each caller now does:

. "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh"   # script name passed for log tagging

Net -40 lines of duplication.

@NuxRo — added custom action. Operator script path is configurable via kvm.heartbeat.fence.custom.script (default /etc/cloudstack/agent/heartbeat-fence-custom.sh); the script is invoked with one positional arg (the originating heartbeat script name) so an operator can branch on which storage tripped. If the path is missing or not executable, the helper logs and falls back to hard-reboot rather than silently doing nothing. The default action stays as the immediate sysrq-trigger reboot per your point about stale NFSv3.

@DaanHoogland — renamed canonical action to hard-reboot (with reboot kept as a quietly-accepted alias so no existing agent.properties files break). Default value updated everywhere (AgentProperties.java + agent.properties + helper).

Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage

6 participants